Before I begin plotting the data, I want to first figure out a couple of things about the variables. First, how many of each quality are there?
summary(wf)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
table(wf$quality)
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
There are 4898 observations of 12 variables. Each observation (or row) is 11 variables descirbing various chemical/physical aspects of a wine plus the median of the ratings given by judges of that wine, 0 being the lowest rating and 10 being the highest.
Quality is the feature of interest - the goal of this analysis is to explore what other features of the data explain the quality the wines.
From what I have read in the readme for the dataset, I am expecting levels of sulfur dioxide to play a part in determining quality - it seems like there ought to be a balance of sulfur dioxide. Too much will cause a bad, sulfurous odor, while too little may make the wine not fresh. Beyond that, my non-existent knowledge of wine would have me expect that sugar levels, alcohol content, and salt content would all have some sort of effect on quality, though in what way I really have no idea at this point. I also expect acidity to be a factor in determining quality
As you can see, there are no wines with ratings of 0, 1, 2, or 10. There are only 5 wines with ratings of 9 and 20 with ratings of 3. This seems like a good indication that I can group some of these variables together into buckets: “high”, “medium-high”, “medium”, “medium-low”, and “low”. I’ll do this by adding a new variable: quality_level. This will let me use geom_freqpoly and facet_wrap more effectively, since I won’t have one category with only 5 observations in it and another category with over 2000. Low is 3 and 4, medium low is 5, medium is 6, medium-high is 7, and high is 8 and 9.
This distribution is somewhat normal, though there are several hundred more medium-low wines than medium-high wines. Still, I think this will serve as a suitable replacement for quality in terms of plotting.
## Source: local data frame [5 x 12]
##
## quality_level median_fixed_acidity median_volatile_acidity
## 1 low 6.9 0.32
## 2 medium-low 6.8 0.28
## 3 medium 6.8 0.25
## 4 medium-high 6.7 0.25
## 5 high 6.8 0.26
## Variables not shown: median_citric_acid (dbl), median_total_acidity (dbl),
## median_alcohol (dbl), median_sugar (dbl), median_ph (dbl),
## median_chlorides (dbl), median_total_so2 (dbl), median_free_so2 (dbl),
## median_sulphates (dbl)
Here, I have done some aggregations on the various features. I have also created a new variable called total acidity, which is the sum of citric, fixed, and volatile acidity.
##
## Pearson's product-moment correlation
##
## data: wf$fixed.acidity and wf$volatile.acidity
## t = -1.5886, df = 4896, p-value = 0.1122
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.050671536 0.005312543
## sample estimates:
## cor
## -0.02269729
##
## Pearson's product-moment correlation
##
## data: wf$citric.acid and wf$fixed.acidity
## t = 21.137, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2633067 0.3146389
## sample estimates:
## cor
## 0.2891807
##
## Pearson's product-moment correlation
##
## data: wf$citric.acid and wf$volatile.acidity
## t = -10.578, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1767384 -0.1219760
## sample estimates:
## cor
## -0.1494718
##
## Pearson's product-moment correlation
##
## data: wf$total.acidity and wf$pH
## t = -33.388, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4531918 -0.4075605
## sample estimates:
## cor
## -0.4306513
The correlation between fixed and volatile acidity is pretty small, but there is a correlation between citric acid and fixed acidity of 0.29. This is positive, unlike the correlation between citric acid and volatile acidity of -0.15. I am not sure why that is, but it seems interesting that more citric acid increases fixed acidity but decreases volatile acidity.
As expected, total acidity is negatively correlated with pH - more acids (obviously) mean lower pH.
##
## Pearson's product-moment correlation
##
## data: wf$alcohol and wf$residual.sugar
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4726723 -0.4280267
## sample estimates:
## cor
## -0.4506312
What seems interesting here is that while plotting sugar on its own against quality does not show much of a correlation, plotting residual sugar against alcohol and then coloring by quality seems to show that higher quality wines, which tend to have higher alcohol contents, also tend to have lower sugar levels than wines with lower alcohol contents. It is clear that plotting sugar with alcohol content strengthened both features.
The correlation between alcohol and sugar is -0.45, which is very strong. As alcohol increases, sugar levels tend to decrease, which confirms what we see in our plot. Perhaps this is a result of wine-creating bacteria consuming more sugar to produce more ethanol.
##
## Pearson's product-moment correlation
##
## data: wf$residual.sugar and wf$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
##
## Pearson's product-moment correlation
##
## data: wf$alcohol and wf$density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
##
## Pearson's product-moment correlation
##
## data: wf$chlorides and wf$density
## t = 18.624, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2308679 0.2831779
## sample estimates:
## cor
## 0.2572113
As expected, both alcohol content and residual sugar are highly correlated with density. If we were to create a linear regression model for quality, we should avoid having all three of these variables in the model, as multicollinearity would become a significant problem. Salt content is also correlated with density, though to a lesser extent than the other two features.
##
## Pearson's product-moment correlation
##
## data: wf$chlorides and wf$residual.sugar
## t = 6.2299, df = 4896, p-value = 5.057e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06082916 0.11640188
## sample estimates:
## cor
## 0.08868454
##
## Pearson's product-moment correlation
##
## data: wf$chlorides and wf$alcohol
## t = -27.016, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3843183 -0.3355673
## sample estimates:
## cor
## -0.3601887
What is interesting is the inverse correlation between alcohol and chlorides, which I would not have expected. It seems that there are no wines with low alcohol content and low chloride levels and no wines with high alcohol content and high chloride levels. I am not sure why that is - perhaps it is a side effect of making wine with high alcohol content, or that high quality wines are produced with the goal of high alcohol content and low salt content in mind. Regardless, they are correlated, so we should bear that in mind while constructing a model so as to keep multicollinearity at a minimum.
##
## Pearson's product-moment correlation
##
## data: wf$sulphates and wf$total.sulfur.dioxide
## t = 9.5019, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1069590 0.1619585
## sample estimates:
## cor
## 0.1345624
##
## Pearson's product-moment correlation
##
## data: wf$sulphates and wf$free.sulfur.dioxide
## t = 4.1508, df = 4896, p-value = 3.369e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03126264 0.08707928
## sample estimates:
## cor
## 0.05921725
##
## Pearson's product-moment correlation
##
## data: wf$free.sulfur.dioxide and wf$total.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5977994 0.6326026
## sample estimates:
## cor
## 0.615501
The level of sulphates in a wine does not seem to be very closely related to the amount of sulfur dioxide both in gaseous and dissolved form, though this is expected because the readme for the dataset says that sulphate level contributes only a small amount to sulfur dioxide.
As expected, total sulfur dioxide and free sulfur dioxide are pretty strongly correlated.
First, let us talk about the features other than the feature of interest that are correlated with each other. Some were obvious and expected, others are not: - alcohol and density - sugar and density - chlorides and alcohol - total sulfur dioxide and free sulfur dioxide - all the various acidities with pH
Now, let us list how these features relate to quality: - higher alcohol and higher quality - lower sugar and higher quality - lower total sulfur dioxide and higher quality - lower acidity and higher quality - lower salt content and higher quality
With this information, we can improve on our expectations of what makes for a high quality wine. Good wines tend to have higher alcohol contents, fruitier flavor (due to higher citric acid content), lower sugar levels, lower salt levels, lower sulfur dioxide levels, and lower overall acidity. I have left out features such as density, which is too strongly correlated with more important features such as alcohol content and chloride levels, and sulphates, which does not seem to be correlated with quality and is only very slighty correlated with total sulfur dioxide.
##
## Calls:
## m1: lm(formula = alcohol ~ quality, data = wf)
## m2: lm(formula = alcohol ~ quality + volatile.acidity, data = wf)
## m3: lm(formula = alcohol ~ quality + volatile.acidity + chlorides,
## data = wf)
## m4: lm(formula = alcohol ~ quality + volatile.acidity + chlorides +
## residual.sugar, data = wf)
## m5: lm(formula = alcohol ~ quality + volatile.acidity + chlorides +
## residual.sugar + total.sulfur.dioxide, data = wf)
##
## ============================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------------
## (Intercept) 6.957*** 6.166*** 7.351*** 8.085*** 8.963***
## (0.106) (0.123) (0.127) (0.114) (0.117)
## quality 0.605*** 0.648*** 0.567*** 0.526*** 0.493***
## (0.018) (0.018) (0.017) (0.015) (0.015)
## volatile.acidity 1.936*** 2.043*** 2.264*** 2.365***
## (0.158) (0.150) (0.132) (0.127)
## chlorides -16.128*** -14.541*** -12.681***
## (0.693) (0.612) (0.594)
## residual.sugar -0.098*** -0.076***
## (0.003) (0.003)
## total.sulfur.dioxide -0.007***
## (0.000)
## ----------------------------------------------------------------------------
## R-squared 0.190 0.214 0.292 0.452 0.495
## adj. R-squared 0.190 0.214 0.292 0.451 0.495
## sigma 1.108 1.091 1.036 0.912 0.875
## F 1146.395 666.007 673.471 1008.012 959.723
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -7450.661 -7376.455 -7119.517 -6493.901 -6291.857
## Deviance 6009.118 5829.768 5249.127 4065.773 3743.809
## AIC 14907.323 14760.910 14249.034 12999.801 12597.714
## BIC 14926.812 14786.896 14281.517 13038.781 12643.190
## N 4898 4898 4898 4898 4898
## ============================================================================
In the end, using just a pretty basic linear model, we get an R-squared of 0.496, which is not too shabby. Of course, this is far from a perfect model - a linear regression simply cannot capture all the subtleties of the data. I also included both chlorides and alcohol in the model, even though I already know that they are correlated with each other. Thus, there is some degree of multicollinearity that is negatively affecting the truth of the model.
This plot shows very obviously that there is definitely a trend towards higher alcohol content as wine quality increases. Just having a higher alcohol content seems to be a huge factor in determining wine quality - the entire boxplot moves up for each increase in quality level, which is not something I would have expected. It really makes me wonder why exactly alcohol is so strongly correlated with wine quality, and whether that bears out in real life. This plot sparked much of the exploration in regard to whether other features were strongly correlated with alcohol - is higher alcohol content a result of a general higher-quality wine making process, or is it purposefully sought after in the wine making process? I spent much of my time trying to explore this angle in this report.
I selected these first two plots because they reveal quirks of the data that you wouldn’t have been able to see otherwise. During my EDA, it was hard to see whether sugar content related at all to wine quality - different levels of sugar content seemed to be distributed quite evenly across all wine qualities. However, this plot immediately reveals two things: (1) higher quality alcohol does, in fact, have lower sugar levels, and (2) there are no high alcohol and high sugar content wines. The insights this plot offered me meant I now was willing to use residual sugar as a feature in the linear regression model I hoped to build, since it was clearly correlated with wine quality. And this paid off - adding residual.sugar to my linear model raised the R-squared value (unadjusted) from 0.292 to 0.453.
This reveals a relationship between features that I had not expected at all. For some reason, alochol content seems to be inversely correlated with salt content - and high quality wines are overwhelmingly concentrated in the area of the plot where salt content is low and alcohol percentage is high.
This project was intimidating at first because there were so many features. Which ones should I concentrate on? Which ones would actually have any effect? And once I plotted the distributions of each with regard to quality, I did not come out as elucidated as I had thought I would be - only alcohol and perhaps salt seemed to contribute in any way to wine quality. This was unlike the diamond data set, in which features were fewer and there were universally defined metrics for what made a better diamond.
Still, there were a few common sense hunches that I had regarding what would affect wine quality - I feel that oftentimes, our own intuition is where we begin in such investigations, and in the process of confirming or invalidating those intuitions, we discover new quirks and trends that would not have occurred to us without such exploration. That is what happened with me - I felt that sugar levels and acidity ought to have some significant effect on wine quality.
After trying various plots with sugar levels, I was about ready to give up. There seemed to be no rhyme nor reason with sugar content across different quality wines. However, when I finally plotted sugar vs. alcohol content and colored the points by quality, sugar’s inverse relationship with wine quality finally revealed itself. Needless to say, I was pleased. However, this plot also revealed sugar’s inverse correlation with alcohol, which made me wonder why exactly would there be a relationship between alcohol and sugar? Is it because of the fermentation process that converts sugar into ethanol, and therefore the higher the alcohol content, the lower the sugar level?
This induced me to investigate further the relationship between alcohol and other features, and I found, to my surprise, that chlorides and alcohol were also inversely correlated. Wines with high alcohol contents also had low salt contents, and were generally rated higher, than wines with low alcohol contents with high salt contenst, which were generally rated lower. In fact, there seemed to be a relative dearth of wines that had both high alcohol and high salt contents as well as both low alcohol and low salt contents. This begs the same question that the discovery of sugar’s relationship with alcohol evoked: was this a result of the wine making process that naturally meant high quality wines had high alcohol contents and low salt contents, or was this due to wine makers purposely choosing to make wines with these characteristics? I do not think this is a question that can be answered with EDA alone - it would require an understanding of the wine making process as well.
Once I got the ball rolling in mixing and matching features to see if anything strange and interesting popped out, it was a relatively straightforward process to see how acidity related to wine quality. Strangely enough, it turned out that higher citric acid was correlated with higher wine quality even though overall acidity (as measured by pH and my total acidity variable) was correlated with lower wine quality. I attributed this to higher citric acid levels making wines taste fruitier. Also, total acidity was dominated by fixed acidity - citric acid was a small enough component of total acidity that its level was nearly neglible in determining pH, so this was actually a finding that made sense. Too acidic of a wine probably tastes bad, but fruitier wine tastes better.
There are still many things that could be done. There are some combinations of features that I have not plotted - namely, that between sulphates and sulfur dioxide levels with density, and whether that could change anything in my analysis. Perhaps using more boxplots would also reveal some interesting things.
Also, if I were to spend more time on this, I would likely create more robust models for predicting wine quality - using naive Bayes, or vector models, or a logistic regression. My linear model had decent results, but is not as good of a model as a model could be.
I would like to actually compare white wines with red wines - there would probably be a lot of interesting insights into the character of these two wines, in terms of their various acidities, alcohol contents, sugar levels, etc., and what makes for a high quality red or white wine.